[None][feat] Support request-scoped capacity-only KV cache compaction by Hudayday · Pull Request #15697 · NVIDIA/TensorRT-LLM

Hudayday · 2026-06-28T16:33:47Z

Description

KV-cache compression can physically compact a request's live KV into a
smaller dense prefix. During generation, the V2 runtime adapter normally
passes the request's monotonically growing logical token count as
history_length. That behavior is correct for ordinary full attention, but it
causes a compacted request to grow back toward its logical length and prevents
the reclaimed pages from remaining available to the KV pool.

The V2 core already supports the required operation: shrink capacity while
preserving the existing committed history with
resize(capacity, history_length=None).

This PR adds only the request-scoped runtime-adapter plumbing needed by
KV-cache compression. It does not change the V2 core and does not add a
fork/rewind or public manager API.

Changes

Preserve the existing path unless a request explicitly sets
py_kv_cache_generation_capacity_only=True.
For an opted-in generation request, pass history_length=None, preserving
the core's current committed history while allowing physical capacity to
shrink.
Consume an optional
(target_capacity, published_capacity, event) compaction marker.
Wait for the producer CUDA event before reclaimed pages can be reused.
Preserve capacity growth that occurred after marker publication and apply
the current rewind:
target + (live_capacity - published_capacity) - rewind.
Clear the compaction marker only after resize() succeeds so a failed resize
can be retried.
Add focused tests and register them in the A10 pre-merge test list.

Compatibility

Requests that do not explicitly opt in call resize() with the same capacity
and history arguments as before. The opt-in check is request scoped and
fail-closed.

There is no public API change, no V2 core change, and no manager-level
compression state. The compression implementation owns publishing the
request marker and clearing its generation capacity-only flag at request
completion.

Validation

python3 -m compileall: passed
git diff --check: passed
ruff check: passed
ruff format --check: passed
Focused V2 unit test: 5 passed
Integrated TriAttention/V2 focused suite: 71 passed
GPT-OSS 20B/120B dense and union smoke: 4/4 passed

Hudayday · 2026-06-28T16:34:46Z

/bot run --disable-fail-fast

coderabbitai · 2026-06-28T16:39:15Z

📝 Walkthrough

Walkthrough

KVCacheManagerV2.update_resources gains a compression reclaim branch: when py_kv_evicted_tokens > 0 and the request is not completing, it calls kv_cache.fork(max_beam - evicted), falling back to resize(None, max_beam - evicted) on exception. A new unit test file covers all five dispatch scenarios and is added to the A10 pre-merge test list.

Changes

KV-cache compression reclaim

Layer / File(s)	Summary
Compression reclaim branch in `update_resources` `tensorrt_llm/_torch/pyexecutor/kv_cache_manager_v2.py`	Adds `completing` flag and `py_kv_evicted_tokens` check; routes evicted non-completing requests through `kv_cache.fork(max_beam - evicted)` with exception-triggered fallback to `resize(None, max_beam - evicted)`. Non-evicted path retains prior `new_capacity` assignment logic.
Unit tests and CI registration `tests/unittest/_torch/pyexecutor/test_kv_cache_v2_compression_reclaim.py`, `tests/integration/test_lists/test-db/l0_a10.yml`	New test module with `_fake_manager` and mock helpers covers five branches: fork-on-eviction, original resize on unevicted, `None`-capacity on completing, fork-failure fallback, and inactive-cache skip. File added to A10 pre-merge list.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

api-compatible

Suggested reviewers

byshiue
heyuhhh
lfr-0531
Superjomn
PerkzZheng

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 54.55% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title clearly matches the main change: request-scoped KV cache compaction and capacity-only handling.
Description check	✅ Passed	The PR description is detailed and relevant, with clear problem, changes, compatibility, and validation sections, though it omits the template's Test Coverage section.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tensorrt_llm/_torch/pyexecutor/kv_cache_manager_v2.py`:
- Around line 2497-2502: The no-free fallback in
kv_cache_manager_v2._KVCacheManagerV2 should not ignore the boolean result from
kv_cache.resize(None, req.max_beam_num_tokens - evicted). After the warning in
the fallback branch, check the return value just like the normal resize path and
treat a failed resize as fatal or otherwise handle it consistently so the
request state stays aligned with the live _KVCache state.

In `@tests/unittest/_torch/pyexecutor/test_kv_cache_v2_compression_reclaim.py`:
- Around line 77-83: The current test coverage only exercises completion with
evicted=0, so the combined evicted-and-completing path in the reclaim logic is
still untested. Add a test in test_kv_cache_v2_compression_reclaim.py using
_fake_manager, _run, and _req with evicted > 0 and a completing state such as
LlmRequestState.GENERATION_COMPLETE or CONTEXT_INIT, and assert that the request
still calls resize(None, max_beam - 1) while fork() is not called. This should
verify the guard in the kv cache reclaim behavior when both conditions are
present.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f47017d9-645a-4086-ab61-1067b72e308b

📥 Commits

Reviewing files that changed from the base of the PR and between 5ec0c84 and c1c180c.

📒 Files selected for processing (3)

tensorrt_llm/_torch/pyexecutor/kv_cache_manager_v2.py
tests/integration/test_lists/test-db/l0_a10.yml
tests/unittest/_torch/pyexecutor/test_kv_cache_v2_compression_reclaim.py

tensorrt-cicd · 2026-06-28T16:40:20Z

PR_Github #56237 [ run ] triggered by Bot. Commit: c1c180c Link to invocation

tensorrt-cicd · 2026-06-28T20:31:26Z

PR_Github #56237 [ run ] completed with state SUCCESS. Commit: c1c180c
/LLM/main/L0_MergeRequest_PR pipeline #45098 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Hudayday · 2026-06-29T01:17:11Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-29T01:23:09Z

PR_Github #56247 [ run ] triggered by Bot. Commit: c1c180c Link to invocation

tensorrt-cicd · 2026-06-29T03:25:04Z

PR_Github #56247 [ run ] completed with state SUCCESS. Commit: c1c180c
/LLM/main/L0_MergeRequest_PR pipeline #45107 completed with status: 'SUCCESS'

CI Report

Link to invocation

Hudayday · 2026-06-29T03:33:59Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-29T03:40:01Z

PR_Github #56292 [ run ] triggered by Bot. Commit: 095f9ff Link to invocation

Hudayday · 2026-06-29T05:31:56Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-29T05:39:52Z

PR_Github #56307 [ run ] triggered by Bot. Commit: c11ca19 Link to invocation

tensorrt-cicd · 2026-06-29T05:44:01Z

PR_Github #56292 [ run ] completed with state ABORTED. Commit: 095f9ff

Link to invocation

tensorrt-cicd · 2026-06-29T08:44:11Z

PR_Github #56307 [ run ] completed with state SUCCESS. Commit: c11ca19
/LLM/main/L0_MergeRequest_PR pipeline #45158 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Hudayday · 2026-06-29T09:12:02Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-29T09:17:51Z

PR_Github #56337 [ run ] triggered by Bot. Commit: c11ca19 Link to invocation

tensorrt-cicd · 2026-06-29T11:14:57Z

PR_Github #56337 [ run ] completed with state SUCCESS. Commit: c11ca19
/LLM/main/L0_MergeRequest_PR pipeline #45186 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Hudayday · 2026-06-29T12:48:54Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-29T12:54:12Z

PR_Github #56371 [ run ] triggered by Bot. Commit: c11ca19 Link to invocation

tensorrt-cicd · 2026-06-29T14:53:24Z

PR_Github #56371 [ run ] completed with state SUCCESS. Commit: c11ca19
/LLM/main/L0_MergeRequest_PR pipeline #45216 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Hudayday · 2026-06-30T01:30:11Z

/bot run --disable-fail-fast

Hudayday · 2026-06-30T02:42:22Z

/bot run --disable-fail-fast

Hudayday · 2026-06-30T04:37:35Z

/bot run --disable-fail-fast

Hudayday · 2026-06-30T06:23:21Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-30T06:28:45Z

PR_Github #56516 [ run ] triggered by Bot. Commit: 9f3148e Link to invocation

tensorrt-cicd · 2026-06-30T11:58:56Z

PR_Github #56516 [ run ] completed with state SUCCESS. Commit: 9f3148e
/LLM/main/L0_MergeRequest_PR pipeline #45354 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Hudayday · 2026-06-30T12:24:02Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-30T12:29:50Z

PR_Github #56605 [ run ] triggered by Bot. Commit: 9f3148e Link to invocation

tensorrt-cicd · 2026-06-30T14:29:35Z

PR_Github #56605 [ run ] completed with state SUCCESS. Commit: 9f3148e
/LLM/main/L0_MergeRequest_PR pipeline #45432 completed with status: 'SUCCESS'

CI Report

Link to invocation

nvpohanh · 2026-07-01T07:43:18Z

[by Codex] @lowsfer Could you review this PR? Thanks!

Signed-off-by: Hudayday <32944717+Hudayday@users.noreply.github.com>

Hudayday · 2026-07-02T10:06:44Z

/bot run --disable-fail-fast

Hudayday · 2026-07-02T12:06:49Z

/bot run --disable-fail-fast

Hudayday · 2026-07-02T14:23:49Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-07-02T14:29:34Z

PR_Github #57196 [ run ] triggered by Bot. Commit: 5ba3c17 Link to invocation

tensorrt-cicd · 2026-07-02T19:04:59Z

PR_Github #57196 [ run ] completed with state SUCCESS. Commit: 5ba3c17
/LLM/main/L0_MergeRequest_PR pipeline #45968 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Hudayday · 2026-07-03T03:19:09Z

/bot run --disable-fail-fast

Signed-off-by: Hudayday <32944717+Hudayday@users.noreply.github.com>

Hudayday · 2026-07-03T04:51:21Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-07-03T04:56:44Z

PR_Github #57359 [ run ] triggered by Bot. Commit: 2607a43 Link to invocation

Hudayday requested a review from a team as a code owner June 28, 2026 16:33

github-actions Bot assigned Hudayday Jun 28, 2026

Hudayday requested a review from lowsfer June 28, 2026 16:33

coderabbitai Bot reviewed Jun 28, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/pyexecutor/kv_cache_manager_v2.py Outdated

Comment thread tests/unittest/_torch/pyexecutor/test_kv_cache_v2_compression_reclaim.py Outdated

Hudayday marked this pull request as draft June 29, 2026 02:32

Hudayday force-pushed the kvcache-v2-nongrowing-reclaim branch 2 times, most recently from 5cdbbf3 to 095f9ff Compare June 29, 2026 03:32

Hudayday force-pushed the kvcache-v2-nongrowing-reclaim branch 2 times, most recently from 2f10293 to c11ca19 Compare June 29, 2026 05:19

Hudayday marked this pull request as ready for review June 29, 2026 05:21

Hudayday marked this pull request as draft July 2, 2026 03:07

Hudayday force-pushed the kvcache-v2-nongrowing-reclaim branch 2 times, most recently from ef05e64 to b053331 Compare July 2, 2026 09:28

Hudayday changed the title ~~[None][feat] Support non-growing KV history in KVCacheManagerV2 for KV-cache compression~~ [None][feat] Support request-scoped capacity-only KV cache compaction Jul 2, 2026

Hudayday changed the title ~~[None][feat] Support request-scoped capacity-only KV cache compaction~~ [None][feat] Support request-scoped capacity-only KV cache compaction Jul 2, 2026

[None][feat] Support capacity-only KV cache compaction

5ba3c17

Signed-off-by: Hudayday <32944717+Hudayday@users.noreply.github.com>

Hudayday force-pushed the kvcache-v2-nongrowing-reclaim branch from b053331 to 5ba3c17 Compare July 2, 2026 09:54

Hudayday marked this pull request as ready for review July 2, 2026 09:58

lowsfer requested a review from yizhang-nv July 2, 2026 13:30

Merge branch 'main' into kvcache-v2-nongrowing-reclaim

2607a43

Signed-off-by: Hudayday <32944717+Hudayday@users.noreply.github.com>

Uh oh!

Conversation

Hudayday commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes

Compatibility

Validation

Uh oh!

Hudayday commented Jun 28, 2026

Uh oh!

coderabbitai Bot commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tensorrt-cicd commented Jun 28, 2026

Uh oh!

tensorrt-cicd commented Jun 28, 2026

Uh oh!

Hudayday commented Jun 29, 2026

Uh oh!

tensorrt-cicd commented Jun 29, 2026

Uh oh!

tensorrt-cicd commented Jun 29, 2026

Uh oh!

Hudayday commented Jun 29, 2026

Uh oh!

tensorrt-cicd commented Jun 29, 2026

Uh oh!

Hudayday commented Jun 29, 2026

Uh oh!

tensorrt-cicd commented Jun 29, 2026

Uh oh!

tensorrt-cicd commented Jun 29, 2026

Uh oh!

tensorrt-cicd commented Jun 29, 2026

Uh oh!

Hudayday commented Jun 29, 2026

Uh oh!

tensorrt-cicd commented Jun 29, 2026

Uh oh!

tensorrt-cicd commented Jun 29, 2026

Uh oh!

Hudayday commented Jun 29, 2026

Uh oh!

tensorrt-cicd commented Jun 29, 2026

Uh oh!

tensorrt-cicd commented Jun 29, 2026

Uh oh!

Hudayday commented Jun 30, 2026

Uh oh!

Hudayday commented Jun 30, 2026

Uh oh!

Hudayday commented Jun 30, 2026

Uh oh!

Hudayday commented Jun 30, 2026

Uh oh!

tensorrt-cicd commented Jun 30, 2026

Uh oh!

tensorrt-cicd commented Jun 30, 2026

Uh oh!

Hudayday commented Jun 30, 2026

Uh oh!

tensorrt-cicd commented Jun 30, 2026

Uh oh!

tensorrt-cicd commented Jun 30, 2026

Uh oh!

nvpohanh commented Jul 1, 2026

Uh oh!

Hudayday commented Jul 2, 2026

Hudayday commented Jun 28, 2026 •

edited

Loading

coderabbitai Bot commented Jun 28, 2026 •

edited

Loading